-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Fix GTDB database setup #329
Conversation
* Pin sphinx to version 6 * readthedocs build now requires installing autometa using `pip` in .readthedocs.yml * Add mocks for gdown, attrs, numpy, pandas, scipy, numba, skbio, trimap * Pin docutils between 0.18 and 0.20 * Pin sphinx_rtd_theme to 1.2
- `autometa-binning` parameter explanation is now in the same order as the commands are input - deprecated `--domain` has been replaced with `--rank-filter-name`
🐛 Fix bug with running GTDB taxonomic workflow
Why do the files need to be decompressed? |
Decompression is required to modify the fasta headers and then concatenate the sequences together for the final database. |
That's the .sh files /grep that was also modified? |
Yes, that's an independent bug fix that I found during some testing. It adds an underscore after the orf ID, preventing any partial matches. |
Possible to use zgrep on still gzip'ed files instead? |
Or pipe the output from zcat, or use the gzip module in Python |
We are doing file modification on the files to change the FASTA headers.
I am currently using the |
My confusion and suggestion of zgrep was because I thought the edits were related (in the future try to only fix a single thing in a PR or at least separate out the commits) My question is then the same as Jason's- is there a reason not to just read the files using the gzip module rather than decompress, write and then read back in Is the following code's single purpose to read some fasta files, edit the identifier and then concatenate into a single file? " single purpose" meaning no other code relies on any of the extracted files |
Yeah seems like there are too many steps.
|
Just to comment before I leave for the weekend... |
Hit submit before seeing @jason-c-kwan responded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like some formatting should be done prior to this being merged (details in the comments). Otherwise I had just a few questions & suggestions.
commit @Sidduppal additions and @WiscEvan suggestions minus changes moved to separate PR Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com>
…into sidd/gtdb-hotfix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
- 💚🐳🔥⬆️ Remove pins for scipy, scikit-learn and joblib - 💚 🐳 Add build schedule for Autometa docker images > This will help to more quickly identify when builds begin failing > Add `nightly` tag for scheduled build - 🐳 change user workdir to `/Autometa`
Hi, I'd like to use the gtdb database, but I'm not able to build it. Any news when this will be fixed? Cheers Bastian |
Looks like the tests are failing due to a recent issue with hdbscan and cython (scikit-learn-contrib/hdbscan#600) |
a78f0ce
to
972ddb5
Compare
Rolling back cython as suggested by some comments in
didn't work. . |
autometa/taxonomy/gtdb.py
Outdated
if line.startswith(">"): | ||
seqheader = line.lstrip(">") | ||
line = f"\n>{acc} {seqheader}" | ||
f_out.write(line) | ||
outline = f"\n>{acc} {seqheader}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor quibble. It is convention to put the newline character at the end of the line.
The newline character at the beginning looks kind of odd 🤷
outline = f"\n>{acc} {seqheader}" | |
outline = f">{acc} {seqheader}\n" |
This would also require changing
# from
seqheader = line.lstrip(">")
# to
seqheader = line.lstrip(">").strip()
Append underscore to contig id to prevent partial matches See also: #329 (comment)
🐛 Fix bug with running GTDB taxonomic workflow
🐛 Fix bug with setting up gtdb database. issue328